In this preliminary section, we’ll cover basic information that will help you to get started with RStudio.
If you haven’t already, please go ahead and install both the R and RStudio applications. R and RStudio must be installed separately; you should install R first, and then RStudio. The R application is a bare-bones computing environment that supports statistical computing using the R programming language; RStudio is a visually appealing, feature-rich, and user-friendly interface that allows users to interact with this environment in an intuitive way. Once you have both applications installed, you don’t need to open up R and RStudio separately; you only need to open and interact with RStudio (which will run R in the background).
The following subsections provide instructions on installing R and RStudio for the macOS and Windows operating systems. These instructions are taken from the “Setup” section of the Data Carpentry Course entitled R for Social Scientists. The Data Carpentry page also contains installation instructions for the Linux operating system; if you’re a Linux user, please refer to that page for instructions.
The Appendix to Garret Grolemund’s book Hands on Programming with R also provides an excellent overview of the R and RStudio installation process.
.exe file that was just downloaded..pkg file for the latest R version.Now that we’ve installed and opened up RStudio, let’s familiarize ourselves with the RStudio interface. When we open up RStudio, we’ll see a window that looks something like this:
The RStudio Interface
If your interface doesn’t look exactly like this, it shouldn’t be a problem; we would expect to see minor cosmetic differences in the appearance of the interface across operating systems and computers (based on how they’re configured). However, you should see four distinct windows within the larger RStudio interface:
File button on the RStudio menu bar, scroll down to New File button, and then select R Script from the menu bar that opens up.View() function, which will display the relevant data within a new tab in the Source window.R is an open-source programming language for statistical computing that allows users to carry out a wide range of data analysis and visualization tasks (among other things). One of the big advantages of using R is that it has a very large user community among social scientists, statisticians, and digital humanists, who frequently publish R packages. One might think of packages as workbooks of sorts, which contain a well-integrated set of R functions, scripts, data, and documentation; these “workbooks” are designed to facilitate certain tasks or implement useful procedures. These packages are then shared with the broader R user community, and at this point, anyone who needs to accomplish the tasks to which the package addresses itself can use the package in the context of their own projects. The ability to use published packages considerably simplifies the work of applied data research using R; it means that we rarely have to write code entirely from scratch, and can build on the code that others have published in the form of packages. This allows applied researchers to focus on substantive problems, without having to get too bogged down in complicated programming tasks.
In this workshop, we will use the following packages to carry out relevant data analysis and visualization tasks (please click the relevant link to learn more about a given package; note that the tidyverse is not a single package, but rather an entire suite of packages used for common data science and analysis tasks): + tidyverse: + wosr
To install a package in R, we can use the install.packages() function. A function is essentially a programming construct that takes a specified input, runs this input (called an “argument”) through a set of procedures, and returns an output. In the code block below, the name of the package we want to install (here, the tidyverse suite) is enclosed within quotation marks and placed within parentheses after printing install.packages Running the code below will effectively download the tidyverse suite of packages to our computer:
# Installs "tm" package
install.packages("tidyverse")
To run this code in your own R session:
Edit menu of your browser).Below, we can see how that line of code should look in your script, and how to run it:
Installing tidyverse in R Script
Please note that you can follow along with the tutorial on your own computers by transferring all of the subsequent codeblocks into your script in just this way. Run each codeblock in your RStudio environment as you go, and you should be able to replicate the entire tutorial on your computer. You can copy-paste the workshop code if you wish, but we recommend actually retyping the code into your script, since this will help you to more effectively familiarize yourself with the process of writing code in R.
Note that the codeblocks in the tutorial usually have a comment, prefaced by a hash (“#”). When writing code in R (or any other command-line interface) it is good practice to preface one’s code with brief comments that describe what a block of code is doing. Writing these comments can allow someone else (or your future self) to read and quickly understand the code more easily than otherwise might be the case. The hash before the comment effectively tells R that the subsequent text is a comment, and should be ignored when running a script. If one does not preface the comment with a hash, R wouldn’t know to ignore the comment, and would throw an error message.
Now, let’s install the other packages we mentioned above, using the same install.packages() function:
install.packages("wosr")
All of the packages we need are now installed!
However, while our packages are installed, they are not yet ready to use. Before we can use our packages, we must load them into our environment. We can think of the process of loading installed packages into a current R environment as analogous to opening up an application on your phone or computer after it has been installed (even after an application has been installed, you can’t use it until you open it!). To load (i.e. “open”) an R package, we pass the name of the package we want to load as an argument to the library() function. For example, if we want to load our tidyverse packages into the current environment, we can type:
# Loads tidyverse packages into memory
library(tidyverse)
At this point, the full suite of the tidyverse suite’s functionality is available for us to use.
Now, let’s go ahead and load the remainder of the packages that we’ll need:
# loads remainder of required packages
library(wosr)
At this point, the packages are loaded and ready to go! One important thing to note regarding the installation and loading of packages is that we only have to install packages once; after a package is installed, there is no need to subsequently reinstall it. However, we must load the packages we need (using the library function) every time we open a new R session. In other words, if we were to close RStudio at this point and open it up later, we would not need to install these packages again, but would need to load the packages again.
Before we can get a sense of how to work with data in R, it is important to familiarize ourselves with basic features of the R language’s syntax, and the basic data structures that are used to store and process data.
At its most basic, R can be used as a calculator. For instance:
# calculates 2+2
2+2
## [1] 4
# calculates 65 to the power of 4
65^4
## [1] 17850625
While this is a useful starting point, the possibility of assigning values to objects (or variables) considerably increases the scope of the operations we are able to carry out. We turn to object assignment in the next sub-section.
The concept of object (or variable) assignment is a fundamental concept when working in a scripting environment; indeed, the ability to easily assign values to objects is what allows us to easily and intuitively manipulate and process our data in a programmatic setting. To better understand the mechanics of object assignment, consider the following:
# assign value 5 to new object named x
x<-5
In the code above, we use R’s assignment operator, <-, to assign the value 5 to an object named x. Now that an object named x has been created and assigned the value 5, printing x in our console (or printing x in our script and running it) will return the value that has been assigned to the x object, i.e. 5:
# prints value assigned to "x"
x
## [1] 5
More generally, the process of assignment effectively equates the output created by the code on the right side of the assignment operator (<-) to an object with a name that is specified on the left side of the assignment operator. Whenever we want to look at the contents of an object (i.e. the output created by the code to the right side of the assignment operator), we simply print the name of the object in the R console (or print the name and run it within a script).
Let’s create another object, named y, and assign it the value “12”:
# assign value 12 to new object named y
y<-12
As we noted above, we can print the value that was assigned to y by printing its name:
# prints value assigned to "y"
y
## [1] 12
It’s possible to use existing objects to assign values to new ones. For example, we can assign the sum of x and y to a new object that we’ll name xy_sum:
# creates a new object, named "xy_sum" whose value is the sum of "x" and "y"
xy_sum<-x+y
Now, let’s print the contents of xy_sum
# prints contents of "xy_sum"
xy_sum
## [1] 17
As expected, we see that the value assigned to xy_sum is “17” (i.e. the sum of the values assigned to x and y).
It is possible to change the value assigned to a given object. For example, let’s say we want to change the value assigned to x from “5” to “8”:
# assign value of "8" to object named "x"
x<-8
We can now confirm that x is now associated with the value “8”
# prints updated value of "x"
x
## [1] 8
It’s worth noting that updating the value assigned to x will not automatically update the value assigned to xy_sum (which, recall, is the sum of x and y). If we print the value assigned to xy_sum, note that it is still “17”):
xy_sum
## [1] 17
In order for the value assigned to xy_sum to be updated with the new value of x, we must run the assignment operation again:
# assigns sum of "y" and newly updated value of "x" to "xy_sum" object
xy_sum<-x+y
Now, the value of xy_sum should reflect the updated value of x, which we can confirm by printing the value of xy_sum:
# prints value of "xy_sum"
xy_sum
## [1] 20
Note that the value assigned to xy_sum is now “20” (the sum of “8” and “12”), rather than “17” (the sum of “5” and “12”).
While the examples above were very simple, we can assign virtually any R code, and by extension, the data structure(s) generated by that code (such as datasets, vectors, graphs/plots etc.) to an R object. When naming your objects, try to be descriptive, so that the name of the object signifies something about its corresponding value.
Below, consider a simple example of an object, named our_location that has been assigned a non-numeric value. It’s value is a string, or textual information:
our_location<-"Boulder, CO"
We can print string that has been assigned to the location object by typing the name of the object in our console, or running it from our script:
# prints value of "our_location" object
our_location
## [1] "Boulder, CO"
Note that generally speaking, you have a lot of flexibility in naming your R objects, but there are certain rules. For example, object names must start with a letter, and cannot contain any special symbols (they can only contain letters, numbers, underscores, and periods). Also, object names cannot contain multiple unconnected words; if you’d like to use multiple words or phrases, connect the discrete elements with an underscore (_), or use camel case (where different words are distinguished by beginning each discrete word begins with a capitalized letter).
It is also worth emphasizing that object names are case sensitive; in order to print the value assigned to an object, that object’s name must be printed exactly as it was created. For example, if we were to type our_Location, we would get an error, since there is no our_Location object (only an our_location object):
our_Location
## Error in eval(expr, envir, enclos): object 'our_Location' not found
We now turn to a brief overview of some important data structures that help us to work with data in R. We will consider three data structures that are particularly useful: vectors, data frames, and lists. Note that this is not an exhaustive treatment of data structures in R; there are other structures, such as matrices and arrays, that are also important. However, we will limit our discussion to the data structures that are essential for getting started with data-based research in R.
In R, a vector is a sequence of values. A vector is created using the c() function. For example, let’s make a vector with some arbitrary numeric values:
# makes vector with values 5,7,55,32
c(5, 7, 55, 32)
## [1] 5 7 55 32
If we plan to work with this numeric vector again later in our workflow, it makes sense to assign it to an object, which we’ll call arbitrary_values:
# assigns vector of arbitrary values to new object named "arbitrary_values"
arbitrary_values<-c(5,7,55.6,32.5)
Now, whenever we want to print the vector assigned to the arbitrary_values object, we can simply print the name of the object:
# prints vector assigned to "arbitrary_values" object
arbitrary_values
## [1] 5.0 7.0 55.6 32.5
It is possible to carry out mathematical operations with numeric vectors; for instance, let’s say that we want to double the values in the arbitrary_values vector; to do so, we can simply multiply arbitrary_values by 2, which yields a new vector where each numeric element is twice the corresponding element in arbitrary_values. Below, we’ll create a new vector that doubles the values in arbitrary_values, assign it to a new object named arbitrary_values_2x, and print the contents of arbitrary_values_2x:
# creates a new vector that doubles the values in "arbitrary_values" and assigns it to a new object named
"arbitrary_values_2x"
## [1] "arbitrary_values_2x"
arbitrary_values_2x<-arbitrary_values*2
# prints contents of "arbitrary_values_2x"
arbitrary_values_2x
## [1] 10.0 14.0 111.2 65.0
Now, let’s say we want to add different vectors together; the code below creates a new vector by adding together arbitrary_values and arbitrary_values_2x:
arbitrary_values + arbitrary_values_2x
## [1] 15.0 21.0 166.8 97.5
Note that each element of the resulting vector printed above is the sum of the corresponding elements in arbitrary_values and arbitrary_values_2x.
Other arithmetic operations on numeric vectors are also possible, and you may wish to explore these on your own as an exercise.
In many cases, it is useful to extract a specific element from a vector. Each element in a given vector is assigned an index number, starting with 1; that is, the first element in a vector is assigned an index value of 1, the second element of a vector is assigned an index value of 2, and so on. We can use these index values to extract our desired vector elements. In particular, we can specify the desired index within square brackets after printing the name of the vector object of interest. For example, let’s say we want to extract the 3rd element of the vector in arbitrary_values. We can do so with the following:
# extracts third element of "arbitrary_values_2x" vector
arbitrary_values[3]
## [1] 55.6
It is also possible to extract a range of values from a vector using index values. For example, let’s say we want to extract a new vector comprised of the second, third, and fourth numeric elements in arbitrary_values; we can do so with the following:
# extracts a new vector comprised of the 2nd, 3rd, and 4th elements of the existing "arbitrary_values" vector
arbitrary_values[2:4]
## [1] 7.0 55.6 32.5
Thus far, we have been working with numeric vectors, where each of the vector’s elements is a numeric value, but it is also possible to create vectors in which the elements are strings (i.e. text). Such vectors are know as character vectors. For example, the code below creates a character vector of the first four months of the year, and assigns it to a new object named months_four:
# creates character vector whose elements are the first four months of the year, and assigns the vector to a new object named "months_four"
months_four<-c("January", "February", "March", "April")
Let’s now print the character vector assigned to months_four:
# prints contents of "months_four"
months_four
## [1] "January" "February" "March" "April"
We can extract elements from character vectors using index values in the same way we did so for elements in a numeric vector. For example:
# extracts second element of "months_four" object (i.e. the "February" string)
months_four[2]
## [1] "February"
# subsets the second and third elements of "months_four" object (i.e. the "February" and "March" strings, which are extracted as a new character vector)
months_four[2:3]
## [1] "February" "March"
The data frame structure is the workhorse of data analysis in R. A data frame resembles a table, of the sort you might generate in a spreadsheet application.
Often, the most important (and arduous) step in a data analysis workflow is to assemble disparate strands of data into a tractable data frame. What does it mean for a data frame to be “tractable”? One way to define this concept more precisely is to appeal to the concept of “tidy” data, which is often referenced in the data science world. Broadly speaking, a “tidy” data frame is a table in which:
We will work extensively with data frames later in the workshop, but let’s generate a simple data frame from scratch, and assign it to a new object. We will generate a data frame containing “dummy” country-level data on basic economic, geographic, and demographic variables, and assign it to a new object named country_df. The data frame is created through the use of the data.frame() function, which has already been programmed into R. Column names and their corresponding values are passed to the data.frame() function in the manner below:
# Creates a dummy country-level data frame
country_df<-data.frame(Country=c("Country A", "Country B", "Country C"),
GDP=c(8000, 30000, 23500),
Population=c(2000, 5400, 10000),
Continent=c("South America", "Europe", "North America"))
country_df
## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
view function
identifying data structure
setting working directory